[code_review] Misc improvements part 4#5588
Merged
suhaibmujahid merged 16 commits intomozilla:masterfrom Jan 12, 2026
Merged
Conversation
We can have better tracking with W&B Weave
There was a problem hiding this comment.
Pull request overview
This PR refactors the code review evaluation infrastructure by replacing old script-based evaluation tools with a more modular architecture and W&B Weave integration for tracking evaluations.
Changes:
- Removes legacy evaluation scripts (
code_review_tool_evaluator.py,code_review_tool_evaluator_report.py) and experimental files - Introduces new modular tools for patch summarization, suggestion filtering, and comment matching
- Adds Jupyter notebooks for dataset creation and evaluation using W&B Weave
- Refactors
CodeReviewToolto use Protocol-based dependency injection for better testability - Updates platform base classes to accept both
strandintforpatch_idparameters
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/code_review_tool_evaluator_report.py | Removed legacy evaluation report generator |
| scripts/code_review_tool_evaluator.py | Removed legacy evaluation script (613 lines) |
| experiments/review_helper_modify_filtering_step.ipy | Removed experimental filtering modification script |
| requirements.txt | Added weave>=0.50.0 for evaluation tracking |
| notebooks/code_review_evaluation.ipynb | New notebook for running W&B Weave evaluations |
| notebooks/code_review_create_dataset.ipynb | New notebook for creating evaluation datasets |
| bugbug/tools/suggestion_filtering/prompts.py | Extracted filtering prompts to dedicated module |
| bugbug/tools/suggestion_filtering/agent.py | New modular suggestion filtering tool |
| bugbug/tools/patch_summarization/prompts.py | Extracted summarization prompts to dedicated module |
| bugbug/tools/patch_summarization/agent.py | New modular patch summarization tool |
| bugbug/tools/comment_matching/prompts.py | New prompts for LLM-based comment matching |
| bugbug/tools/comment_matching/agent.py | New tool for matching generated vs ground truth comments |
| bugbug/tools/code_review/scorer.py | New Weave scorers for evaluation metrics |
| bugbug/tools/code_review/utils.py | Refactored to work with structured comment objects |
| bugbug/tools/code_review/prompts.py | Removed prompts moved to specialized modules |
| bugbug/tools/code_review/agent.py | Refactored to use Protocol-based dependencies |
| bugbug/tools/base.py | Simplified by removing version property and print method |
| bugbug/tools/core/platforms/base.py | Updated signature to accept str or int for patch_id |
| bugbug/tools/core/platforms/phabricator.py | Updated signature to accept str or int for patch_id |
| bugbug/tools/core/platforms/swarm.py | Updated signature to accept str or int for patch_id |
| bugbug/code_search/searchfox_api.py | Made get_file parameter optional with default implementation |
| bugbug/code_search/mozilla.py | Made get_file parameter optional with fallback |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Introduces a run_by_diff_id method that retrieves a patch by diff ID from review_data and runs the review process.
Eliminated the abstract version property from GenerativeModelTool and removed the version attribute from CodeReviewTool since it is not used.
Moved suggestion filtering logic from code_review/agent.py to a new suggestion_filtering module. Introduced SuggestionFilteringTool for filtering review comments, updated CodeReviewTool to use the new filterer, and relocated related prompt templates. This improves modularity and separation of concerns for suggestion filtering.
Wrapped comments and rejected examples in <comments-to-filter> and <rejected-examples> tags to improve prompt structure and clarity.
It will be replaced with W&B Weave evaluation pipeline
7e1ca6b to
595cb8c
Compare
Collaborator
|
Did |
marco-c
approved these changes
Jan 12, 2026
Member
Author
The main goal was to simplify the tracking. Now it is an independent tool, so we can evaluate it in isolation. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
These improvements could be reviewed commit by commit.